This analysis was intended to analyse variant positions across various lineages of the kelp complex Alaria marginata in the Gulf of Alaska and British Columbia, Canada. This species was originally recognized as four seperate species before they were folded under Alaria marginata based on DNA barcode and morphological surveys. This analysis is meant to revisit patterns of biodiversity at the genomic level to determine whether the complex is truly a single species (as suggested by Lane et al. 2007), a set of incipient lineages (as suggested by Grant and Bringloe 2020), or if the original morpho species were indeed correct.
In addition, we explore the dataset for consequences of parameter choices for variant filtering prior to analysis. As detailed below, we set parameters under relaxed (emphasizing data quantity), moderate (best practice), and strict (prioritizing confidence and strict homology) perspectives. As you can see from the following figures, phylogenetic signal is very consistent, though relative genetic distances are altered by the choices, while pop gen stats are affected, though relative patterns are generally consistent. Crucially, missing data is known to impact several of these estimates, as missing data is typically treated as the reference allele for calculations. Another set of analyses are slated to correct for missingness and downstream values, particularly for genetic distances and nucleotide diversity. Using corrected distances, we will determine if there is a species “gap” among the A. marginata lineages, using context from other species and North Atlantic populations.
##Mapping parameters for bowtie2; 0.6=up to 10% divergence in mapping high quality reads in end-to-end mode, 0.3=up to 5%, 0.12=up to 2%; bowtie2 manual states minimum-score function f to f(x) = 0 + -0.6 * x, where x is read length.
map_param_nuc=0.6
map_param_mito=0.3
map_param_chloro=0.12
##User inputs following filtering settings for compiling nuclear VCF file
##number represent parameters for relaxed; moderate; strict filtering analyses
min_cov=5; 15; 25
max_cov=100; 100; 100
allelic_balance_low=0.2; 0.2; 0.2
allelic_balance_high=5; 5; 5
minor_allele_frequency=0.01; 0.01; 0.02
min_Q=30 #follows phred scale, 30=1/1000 chance of SNP calling error
max_missing=0.7; 0.85; 0.97 # value is proportion of present data needed to keep a SNP site, so 0.9=10% missingness
##User specifies following parameters for removal of linked SNPs using PLINK
plink_r2=0.25
plink_window_size=25 #in kb
PCA plots
Admixture plot for mederate analysis inc/ ML tree
##results were consistent across analyses. Note cross validation error supports k=3-4 ancestral populations.
Phylogenetic networks w/ outgroup
Organellar phylogenetic ML trees
Heterozygosity/inbreeding coefficients
##Negative values suggest outbreeding, positive values suggest inbreeding
Nucleotide diversity using a 50kb sliding window
##Values are sensitive to missing data and need to be corrected (there is a downward bias as missing data is treated as reference allele by vcftools), hence homogenization with strict analysis (because most sites are removed)
TajimasD test for neutrality using a 25kb sliding window
##Values are consistent, but prone to noise when less data goes into the calculation (i.e. strict). Note, values >0 suggest recent bottleneck (either through migration or selection), values <0 suggest population expansion.
Organellar nucleotide diversity and TajimasD
##organellar pop gen stats
Linkage disequilibrium decay
##The highler levels of LD in Victoria and Lowell Point suggest these populations are more recently established; nucleotide diversity was also relatively low in these populations; signatures of outbreeding suggest these areas were recently established from multiple source populations; especially evident in organellar trees, where victoria features two well supported lineages
##done!